Explore the dataset

In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the processed folder). In the first part, you will create a function to display

Write a function to display an image and the bounding boxes

Implement the display_instances function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

Display 10 images

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods take and shuffle on the dataset.

Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data? For example, think about data distribution. So far, you have only looked at a single file...

Count number of entries in the dataset

Knowing the number of entries in the dataset helps us get an idea of how big the dataset is

Area and count of classes

Knowing how big are the classes in an image and how many classes are there in an image can help us explore the data further.

Plotting the relative area of classes in an image

Box plot and Strip plot can be very useful here to know the distribution.

Conclusion from the above plot-

We can now try to filter the outlies to explore this further-

Conclusion from the above plot-

Strip plots-

Strip plots can help us visualize the distribution better-

Conclusion from the above plot-

One important thing-

Visualize the count of classes in an image

Conclusion- There are a total of 3,52,694 vehicles, 1,03,664 pedastrians and 2639 cyclists.

Frequency of classes in an image instead of the whole dataset-

How many classes are there in an image?

Conclusion of the above two plots-

Plot of images that have the minimum and maximum number of classes in them.

Are consecutive images in a dataset similar?

If consecutive images are of the same scene with very minor changes, then reshuffling the images might put almost same images in the test train and validation datasets.

In this case, we might have to shuffle the dataset in a way that the same scene is not present in test train and validation datasets.

Yes, they are almost the same. Hence, we have to be careful with the approach we take for splitting the dataset.